README NGS project.
Argonauts are a proteins family involved in gene silencing, they recruit miRNAs which are small RNAs used as guides to find the target RNA and destroy it.
C.elegans has 25 different Argonaute proteines, these interact with various miRNAs.
Brown et al. 2017 generated mutants of several Argonaute proteins and sequenced their transcriptomes to compare them to the wild type.
Nevertheless, C. elegans being a fast developing organism, individuals from a sample can have various ages. Therefore there is an impact of the development stage in the variation observed between strains. A cutting edge tool will be used to assess the degree of impact of development on the results.
Ubuntu 20.04 LTS.
Link to the data (Geo database, accession code: GSE98935) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98935
Strains (Subset of the article’s data): WT, alg-2(ok304) II (Argonaute 2 mutant) and alg-5(ram2) I (Argonaute 5 mutant). There are 3 replicates of each strain.
Data was downloaded using the Download_data.sh script. Using the fastq-dump command.
Fasta def lines were changed (defline-seq and defline-qual):
The sequence defline contain: The accession number, the spot id and the read number
The quality defline contain à + symbol.
We used Fastqc.sh (one report per sample). Principal command:
fastqc $file -o . -t 6
We used Multiqc.sh (on Fastqc directory output, one report for all samples). Principal command:
multiqc $data/* -o .
Adapters sequences are present in our reads, that’s why in the next step we will remove them.
multiQC report on raw RNA-seq data
Adapters’ sequences were found in the supplementary data of Brown et al, 2017.
We generated reverse complement sequences using the online tool: http://reverse-complement.com/
Sequences were written on a Fasta file: Adapter_sequences.fa
We used Trimmomatic.sh. Principal command:
java -jar $TRIMMOMATIC_JAR PE -phred33 $input_1 $input_2 $output_1_paired \
$output_1_unpaired $output_2_paired $output_2_unpaired ILLUMINACLIP:$adapter:2:30:10\
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15
ILLUMINACLIP remove adapters, LEADING remove leading low quality bases, TRAILING remove trailing low quality bases, SLIDINGWINDOW scan the read with a 4-base sliding window and cut when the average quality per base drops below 15.
Trimmomatic generate two types of outputs, paired sequences (R1+R2) and unpaired sequences (R1 or R2 alone). For the rest of the analysis we will only keep paired sequences.
I combined the two scripts of Fastqc.sh and Multiqc.sh to form the Control_qual_trim.sh. We see on the Multiqc report that there are no (or almost no) more adapters , and that the read lengths are not as homogeneous as before.
multiQC report after Trimmomatic
The reference transcriptome of C.elegans was downloaded: https://www.ensembl.org/info/data/ftp/index.html
We used the wget command with this link: ftp://ftp.ensembl.org/pub/release-101/fasta/caenorhabditis_elegans/cdna/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz
Create an index using the Transcriptome_index.sh script
We used Salmon to quantify the expression of transcripts in each sample:
Run salmon on experimental data using Salmon.sh. Principal command:
salmon quant -i $data/index_transcriptome -l A \
-1 $input_1 \
-2 $input_2 \
-p 8 --validateMappings -o $quants/${name}_quant
-l is for the library type. -p is the number of threads that will be used for quasi mapping, quantification and bootstrapping. –validateMapping use a more sensitive and accurate mapping algorithm and run an extension alignment dynamic program on the potential mappings it produces.
We used the Multiqc_after_salmon.sh script.
Ajouter captures du rapport multiqc
We used tximport to import Salmon quantifications, it creates a matrix with abundance, counts and transcript length. After that, we put the matrix in a R object, so it’s possible to load the object for future use.
The script is: Tximport.R
tx2gene = wormRef::Cel_genes[, c("transcript_name", "wb_id")] #load transcript IDs and Gene IDs from wormRef
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)
DESeq runs a statistical test on gene expression between WT and mutants. First we estimate the variance from the replicates of each condition. We use counts data with DESeq and adapted the matrix as necessary for DESeq.
DESeq_data <- DESeqDataSetFromMatrix(countData = salmon_matrix$counts, #Counts data from tximport
colData = salmon_matrix$pdata, #information about samples
design= ~ strain) #Conditions
We set the pvalue to 0.05 (as done in Brown et al. 2017).
res_alg2 <- results(dds, name="strain_alg.2.ok304..II_vs_WT", alpha=0.05)
The WT dataset is the reference, if the Log2 FoldChange is >0 then the genes are upregulated in the mutant. If Log2 FoldChange is <0 then the genes are downregulated in the mutant.
Alg2 mutant vs WT
Alg5 mutant vs WT
We saved the differentially expressed gene names in csv files.
We used wormbase enrichment tool to assess the enrichment of mutants upregulated and downregulated genes. https://wormbase.org/tools/enrichment/tea/tea.cgi
Upregulated genes: (The results differ from the article’s, they used different tools, it’s not surprising to see a variability)
Downregulated genes: (Fit the article’s results)
Upregulated genes: (No upregulated genes in the article)
Downregulated genes: (Fit the article’s results)
BiocManager::install("limma")
devtools::install_github("LBMC/RAPToR", build_vignettes = TRUE) devtools::install_github("LBMC/wormRef")
We choose: cell larval to Young Adult as the right wormRef dataset because our samples are at L4 stage and this dataset covers it. We used 500 as the resolution of the interpolated reference.
ref_larv <- prepare_refdata("Cel_larv_YA", "wormRef", n.inter = 500)
We used RAPToR with abundance data, transcripts per million (TPM).
ae_data <- ae(samp = salmon_matrix$tpm, #input gene expression matrix
refdata = ref_larv$interpGE, #interpolated gene expression
ref.time_series = ref_larv$time.series) #Reference time series
Estimated age of each condition
We used two functions for that: - One to get the indices/GExpr of the reference matching sample age estimates.
getrefTP <- function(ref, ae_obj, ret.idx = T){
idx <- sapply(ae_obj$age.estimates[,1], function(t) which.min(abs(ref$time.series - t)))
if(ret.idx)
return(idx)
return(ref$interpGE[,idx])}
refCompare <- function(samp, ref, ae_obj, fac){
ovl <- format_to_ref(samp, getrefTP(ref_larv, ae_data, ret.idx = F))
lm_samp <- lm(t(ovl$samp)~fac)
lm_ref <- lm(t(ovl$ref)~fac) #Operates a statistical test, but we only keep the log2 fold change.
return(list(samp=lm_samp, ref=lm_ref, ovl_genelist=ovl$inter.genes)) }
log1p_samp <- log1p(salmon_matrix$tpm)
rc <- refCompare(log1p_samp, ref_larv, ae_data, strain_groups)
From the output we measured correlation coefficients:
Alg2 vs Ref for all genes: 0.6105381
Alg2 vs Ref for DE genes: 0.7321606
Alg5 vs Ref for all genes: 0.09306767
Alg5 vs Ref for DE genes: 0.1953245
And we plotted the log1p Fold Change.
Alg2 vs Ref log1p Fold Change
Alg5 vs Ref log1p Fold Change